Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.

INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The actual forest cover type for a given observation (30 x 30-meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

In iteration Take1, we established the baseline accuracy for comparison with future rounds of modeling.

In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting the most important attributes, we decreased the modeling time and still maintained a similar level of accuracy when compared to the baseline model.

In iteration Take3, we will examine the feature selection technique of recursive feature elimination (RFE) by using the Random Forest algorithm. By selecting no more than 30 attributes, we hope to maintain a similar level of accuracy when compared to the baseline model.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.48%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 86.07%, which was even better than the predictions from the training data.

From iteration Take2, the performance of the machine learning algorithms achieved an average accuracy of 74.27%. Random Forest achieved an accuracy metric of 85.47% with the training data and processed the testing dataset with an accuracy of 85.85%, which was even better than the predictions from the training data. At the importance level of 99%, the attribute importance technique eliminated 22 of 54 total attributes. The remaining 32 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 1 hour 19 minutes down to 58 minutes, a reduction of 36.2%.

From the current iteration, the performance of the machine learning algorithms achieved an average accuracy of 73.25%. Random Forest achieved an accuracy metric of 84.24% with the training data and processed the testing dataset with an accuracy of 84.77%, which was even better than the predictions from the training data. The RFE technique eliminated 42 of 54 total attributes. The remaining 12 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 1 hour 19 minutes down to 33 minutes, a reduction of 58.2%.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.

Dataset Used: Covertype Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Set up the email notification function

email_notify <- function(msg=""){
  sender <- Sys.getenv("MAIL_SENDER")
  receiver <- Sys.getenv("MAIL_RECEIVER")
  gateway <- Sys.getenv("SMTP_GATEWAY")
  smtpuser <- Sys.getenv("SMTP_USERNAME")
  password <- Sys.getenv("SMTP_PASSWORD")
  sbj_line <- "Notification from R Binary Classification Script"
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
# Set up the muteEmail flag to stop sending progress emails (setting FALSE will send emails!)
notifyStatus <- FALSE
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))

1.c) Load dataset

# Slicing up the document path to get the final destination file name
dataset_path <- 'https://www.kaggle.com/c/forest-cover-type-prediction/download/train.csv'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]

if (!file.exists(dest_file)) {
  # Download the document from the website
  cat("Downloading", dataset_path, "as", dest_file, "\n")
  download.file(dataset_path, dest_file, mode = "wb")
  cat(dest_file, "downloaded!\n")
#  unzip(dest_file)
#  cat(dest_file, "unpacked!\n")
}

inputFile <- dest_file
Xy_original <- read.csv(inputFile, sep=',', header=TRUE, row.names=1)
Xy_original$Cover_Type <- as.factor(Xy_original$Cover_Type)
# Take a peek at the dataframe after the import
head(Xy_original)
##   Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1      2596     51     3                              258
## 2      2590     56     2                              212
## 3      2804    139     9                              268
## 4      2785    155    18                              242
## 5      2595     45     2                              153
## 6      2579    132     6                              300
##   Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1                              0                             510
## 2                             -6                             390
## 3                             65                            3180
## 4                            118                            3090
## 5                             -1                             391
## 6                            -15                              67
##   Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1           221            232           148
## 2           220            235           151
## 3           234            238           135
## 4           238            238           122
## 5           220            234           150
## 6           230            237           140
##   Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1                               6279                1                0
## 2                               6225                1                0
## 3                               6121                1                0
## 4                               6211                1                0
## 5                               6172                1                0
## 6                               6031                1                0
##   Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1                0                0          0          0          0
## 2                0                0          0          0          0
## 3                0                0          0          0          0
## 4                0                0          0          0          0
## 5                0                0          0          0          0
## 6                0                0          0          0          0
##   Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1          0          0          0          0          0          0
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 4          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
##   Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           1           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1           0           1           0           0           0           0
## 2           0           1           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           1           0           0           0
## 5           0           1           0           0           0           0
## 6           0           1           0           0           0           0
##   Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type40 Cover_Type
## 1           0          5
## 2           0          5
## 3           0          2
## 4           0          2
## 5           0          5
## 6           0          2
sapply(Xy_original, class)
##                          Elevation                             Aspect 
##                          "integer"                          "integer" 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                          "integer"                          "integer" 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                          "integer"                          "integer" 
##                      Hillshade_9am                     Hillshade_Noon 
##                          "integer"                          "integer" 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                          "integer"                          "integer" 
##                   Wilderness_Area1                   Wilderness_Area2 
##                          "integer"                          "integer" 
##                   Wilderness_Area3                   Wilderness_Area4 
##                          "integer"                          "integer" 
##                         Soil_Type1                         Soil_Type2 
##                          "integer"                          "integer" 
##                         Soil_Type3                         Soil_Type4 
##                          "integer"                          "integer" 
##                         Soil_Type5                         Soil_Type6 
##                          "integer"                          "integer" 
##                         Soil_Type7                         Soil_Type8 
##                          "integer"                          "integer" 
##                         Soil_Type9                        Soil_Type10 
##                          "integer"                          "integer" 
##                        Soil_Type11                        Soil_Type12 
##                          "integer"                          "integer" 
##                        Soil_Type13                        Soil_Type14 
##                          "integer"                          "integer" 
##                        Soil_Type15                        Soil_Type16 
##                          "integer"                          "integer" 
##                        Soil_Type17                        Soil_Type18 
##                          "integer"                          "integer" 
##                        Soil_Type19                        Soil_Type20 
##                          "integer"                          "integer" 
##                        Soil_Type21                        Soil_Type22 
##                          "integer"                          "integer" 
##                        Soil_Type23                        Soil_Type24 
##                          "integer"                          "integer" 
##                        Soil_Type25                        Soil_Type26 
##                          "integer"                          "integer" 
##                        Soil_Type27                        Soil_Type28 
##                          "integer"                          "integer" 
##                        Soil_Type29                        Soil_Type30 
##                          "integer"                          "integer" 
##                        Soil_Type31                        Soil_Type32 
##                          "integer"                          "integer" 
##                        Soil_Type33                        Soil_Type34 
##                          "integer"                          "integer" 
##                        Soil_Type35                        Soil_Type36 
##                          "integer"                          "integer" 
##                        Soil_Type37                        Soil_Type38 
##                          "integer"                          "integer" 
##                        Soil_Type39                        Soil_Type40 
##                          "integer"                          "integer" 
##                         Cover_Type 
##                           "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##                          Elevation                             Aspect 
##                                  0                                  0 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                                  0                                  0 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                                  0                                  0 
##                      Hillshade_9am                     Hillshade_Noon 
##                                  0                                  0 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                                  0                                  0 
##                   Wilderness_Area1                   Wilderness_Area2 
##                                  0                                  0 
##                   Wilderness_Area3                   Wilderness_Area4 
##                                  0                                  0 
##                         Soil_Type1                         Soil_Type2 
##                                  0                                  0 
##                         Soil_Type3                         Soil_Type4 
##                                  0                                  0 
##                         Soil_Type5                         Soil_Type6 
##                                  0                                  0 
##                         Soil_Type7                         Soil_Type8 
##                                  0                                  0 
##                         Soil_Type9                        Soil_Type10 
##                                  0                                  0 
##                        Soil_Type11                        Soil_Type12 
##                                  0                                  0 
##                        Soil_Type13                        Soil_Type14 
##                                  0                                  0 
##                        Soil_Type15                        Soil_Type16 
##                                  0                                  0 
##                        Soil_Type17                        Soil_Type18 
##                                  0                                  0 
##                        Soil_Type19                        Soil_Type20 
##                                  0                                  0 
##                        Soil_Type21                        Soil_Type22 
##                                  0                                  0 
##                        Soil_Type23                        Soil_Type24 
##                                  0                                  0 
##                        Soil_Type25                        Soil_Type26 
##                                  0                                  0 
##                        Soil_Type27                        Soil_Type28 
##                                  0                                  0 
##                        Soil_Type29                        Soil_Type30 
##                                  0                                  0 
##                        Soil_Type31                        Soil_Type32 
##                                  0                                  0 
##                        Soil_Type33                        Soil_Type34 
##                                  0                                  0 
##                        Soil_Type35                        Soil_Type36 
##                                  0                                  0 
##                        Soil_Type37                        Soil_Type38 
##                                  0                                  0 
##                        Soil_Type39                        Soil_Type40 
##                                  0                                  0 
##                         Cover_Type 
##                                  0

1.d) Data Cleaning

# Not applicable for this iteration of the project.
# Take a peek at the dataframe after the cleaning
head(Xy_original)
##   Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1      2596     51     3                              258
## 2      2590     56     2                              212
## 3      2804    139     9                              268
## 4      2785    155    18                              242
## 5      2595     45     2                              153
## 6      2579    132     6                              300
##   Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1                              0                             510
## 2                             -6                             390
## 3                             65                            3180
## 4                            118                            3090
## 5                             -1                             391
## 6                            -15                              67
##   Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1           221            232           148
## 2           220            235           151
## 3           234            238           135
## 4           238            238           122
## 5           220            234           150
## 6           230            237           140
##   Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1                               6279                1                0
## 2                               6225                1                0
## 3                               6121                1                0
## 4                               6211                1                0
## 5                               6172                1                0
## 6                               6031                1                0
##   Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1                0                0          0          0          0
## 2                0                0          0          0          0
## 3                0                0          0          0          0
## 4                0                0          0          0          0
## 5                0                0          0          0          0
## 6                0                0          0          0          0
##   Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1          0          0          0          0          0          0
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 4          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
##   Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           1           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1           0           1           0           0           0           0
## 2           0           1           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           1           0           0           0
## 5           0           1           0           0           0           0
## 6           0           1           0           0           0           0
##   Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1           0           0           0           0           0           0
## 2           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
##   Soil_Type40 Cover_Type
## 1           0          5
## 2           0          5
## 3           0          2
## 4           0          2
## 5           0          5
## 6           0          2
sapply(Xy_original, class)
##                          Elevation                             Aspect 
##                          "integer"                          "integer" 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                          "integer"                          "integer" 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                          "integer"                          "integer" 
##                      Hillshade_9am                     Hillshade_Noon 
##                          "integer"                          "integer" 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                          "integer"                          "integer" 
##                   Wilderness_Area1                   Wilderness_Area2 
##                          "integer"                          "integer" 
##                   Wilderness_Area3                   Wilderness_Area4 
##                          "integer"                          "integer" 
##                         Soil_Type1                         Soil_Type2 
##                          "integer"                          "integer" 
##                         Soil_Type3                         Soil_Type4 
##                          "integer"                          "integer" 
##                         Soil_Type5                         Soil_Type6 
##                          "integer"                          "integer" 
##                         Soil_Type7                         Soil_Type8 
##                          "integer"                          "integer" 
##                         Soil_Type9                        Soil_Type10 
##                          "integer"                          "integer" 
##                        Soil_Type11                        Soil_Type12 
##                          "integer"                          "integer" 
##                        Soil_Type13                        Soil_Type14 
##                          "integer"                          "integer" 
##                        Soil_Type15                        Soil_Type16 
##                          "integer"                          "integer" 
##                        Soil_Type17                        Soil_Type18 
##                          "integer"                          "integer" 
##                        Soil_Type19                        Soil_Type20 
##                          "integer"                          "integer" 
##                        Soil_Type21                        Soil_Type22 
##                          "integer"                          "integer" 
##                        Soil_Type23                        Soil_Type24 
##                          "integer"                          "integer" 
##                        Soil_Type25                        Soil_Type26 
##                          "integer"                          "integer" 
##                        Soil_Type27                        Soil_Type28 
##                          "integer"                          "integer" 
##                        Soil_Type29                        Soil_Type30 
##                          "integer"                          "integer" 
##                        Soil_Type31                        Soil_Type32 
##                          "integer"                          "integer" 
##                        Soil_Type33                        Soil_Type34 
##                          "integer"                          "integer" 
##                        Soil_Type35                        Soil_Type36 
##                          "integer"                          "integer" 
##                        Soil_Type37                        Soil_Type38 
##                          "integer"                          "integer" 
##                        Soil_Type39                        Soil_Type40 
##                          "integer"                          "integer" 
##                         Cover_Type 
##                           "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##                          Elevation                             Aspect 
##                                  0                                  0 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                                  0                                  0 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                                  0                                  0 
##                      Hillshade_9am                     Hillshade_Noon 
##                                  0                                  0 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                                  0                                  0 
##                   Wilderness_Area1                   Wilderness_Area2 
##                                  0                                  0 
##                   Wilderness_Area3                   Wilderness_Area4 
##                                  0                                  0 
##                         Soil_Type1                         Soil_Type2 
##                                  0                                  0 
##                         Soil_Type3                         Soil_Type4 
##                                  0                                  0 
##                         Soil_Type5                         Soil_Type6 
##                                  0                                  0 
##                         Soil_Type7                         Soil_Type8 
##                                  0                                  0 
##                         Soil_Type9                        Soil_Type10 
##                                  0                                  0 
##                        Soil_Type11                        Soil_Type12 
##                                  0                                  0 
##                        Soil_Type13                        Soil_Type14 
##                                  0                                  0 
##                        Soil_Type15                        Soil_Type16 
##                                  0                                  0 
##                        Soil_Type17                        Soil_Type18 
##                                  0                                  0 
##                        Soil_Type19                        Soil_Type20 
##                                  0                                  0 
##                        Soil_Type21                        Soil_Type22 
##                                  0                                  0 
##                        Soil_Type23                        Soil_Type24 
##                                  0                                  0 
##                        Soil_Type25                        Soil_Type26 
##                                  0                                  0 
##                        Soil_Type27                        Soil_Type28 
##                                  0                                  0 
##                        Soil_Type29                        Soil_Type30 
##                                  0                                  0 
##                        Soil_Type31                        Soil_Type32 
##                                  0                                  0 
##                        Soil_Type33                        Soil_Type34 
##                                  0                                  0 
##                        Soil_Type35                        Soil_Type36 
##                                  0                                  0 
##                        Soil_Type37                        Soil_Type38 
##                                  0                                  0 
##                        Soil_Type39                        Soil_Type40 
##                                  0                                  0 
##                         Cover_Type 
##                                  0

1.e) Splitting Data into Training and Testing Sets

# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol

# Standardize the class column to the name of targetVar if applicable
colnames(Xy_original)[targetCol] <- "targetVar"
# We create training datasets (Xy_train, X_train, y_train) for various visualization and cleaning/transformation operations.
# We create testing datasets (Xy_test, y_test) for various visualization and cleaning/transformation operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
# Use 70% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.70, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]

if (targetCol==1) {
X_train <- Xy_train[,(targetCol+1):totCol]
y_train <- Xy_train[,targetCol]
y_test <- Xy_test[,targetCol]
} else {
X_train <- Xy_train[,1:(totAttr)]
y_train <- Xy_train[,totCol]
y_test <- Xy_test[,totCol]
}

1.f) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  3  by  18
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(Xy_train)
##   Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1      2596     51     3                              258
## 3      2804    139     9                              268
## 4      2785    155    18                              242
## 5      2595     45     2                              153
## 6      2579    132     6                              300
## 7      2606     45     7                              270
##   Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1                              0                             510
## 3                             65                            3180
## 4                            118                            3090
## 5                             -1                             391
## 6                            -15                              67
## 7                              5                             633
##   Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1           221            232           148
## 3           234            238           135
## 4           238            238           122
## 5           220            234           150
## 6           230            237           140
## 7           222            225           138
##   Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1                               6279                1                0
## 3                               6121                1                0
## 4                               6211                1                0
## 5                               6172                1                0
## 6                               6031                1                0
## 7                               6256                1                0
##   Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1                0                0          0          0          0
## 3                0                0          0          0          0
## 4                0                0          0          0          0
## 5                0                0          0          0          0
## 6                0                0          0          0          0
## 7                0                0          0          0          0
##   Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 4          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
## 7          0          0          0          0          0          0
##   Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1           0           0           0           0           0           0
## 3           0           0           1           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
## 7           0           0           0           0           0           0
##   Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
## 7           0           0           0           0           0           0
##   Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
## 7           0           0           0           0           0           0
##   Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1           0           1           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           1           0           0           0
## 5           0           1           0           0           0           0
## 6           0           1           0           0           0           0
## 7           0           1           0           0           0           0
##   Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1           0           0           0           0           0           0
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           0           0
## 6           0           0           0           0           0           0
## 7           0           0           0           0           0           0
##   Soil_Type40 targetVar
## 1           0         5
## 3           0         2
## 4           0         2
## 5           0         5
## 6           0         2
## 7           0         5

2.a.ii) Dimensions of the dataset.

dim(Xy_train)
## [1] 10584    55

2.a.iii) Types of the attributes.

sapply(Xy_train, class)
##                          Elevation                             Aspect 
##                          "integer"                          "integer" 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                          "integer"                          "integer" 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                          "integer"                          "integer" 
##                      Hillshade_9am                     Hillshade_Noon 
##                          "integer"                          "integer" 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                          "integer"                          "integer" 
##                   Wilderness_Area1                   Wilderness_Area2 
##                          "integer"                          "integer" 
##                   Wilderness_Area3                   Wilderness_Area4 
##                          "integer"                          "integer" 
##                         Soil_Type1                         Soil_Type2 
##                          "integer"                          "integer" 
##                         Soil_Type3                         Soil_Type4 
##                          "integer"                          "integer" 
##                         Soil_Type5                         Soil_Type6 
##                          "integer"                          "integer" 
##                         Soil_Type7                         Soil_Type8 
##                          "integer"                          "integer" 
##                         Soil_Type9                        Soil_Type10 
##                          "integer"                          "integer" 
##                        Soil_Type11                        Soil_Type12 
##                          "integer"                          "integer" 
##                        Soil_Type13                        Soil_Type14 
##                          "integer"                          "integer" 
##                        Soil_Type15                        Soil_Type16 
##                          "integer"                          "integer" 
##                        Soil_Type17                        Soil_Type18 
##                          "integer"                          "integer" 
##                        Soil_Type19                        Soil_Type20 
##                          "integer"                          "integer" 
##                        Soil_Type21                        Soil_Type22 
##                          "integer"                          "integer" 
##                        Soil_Type23                        Soil_Type24 
##                          "integer"                          "integer" 
##                        Soil_Type25                        Soil_Type26 
##                          "integer"                          "integer" 
##                        Soil_Type27                        Soil_Type28 
##                          "integer"                          "integer" 
##                        Soil_Type29                        Soil_Type30 
##                          "integer"                          "integer" 
##                        Soil_Type31                        Soil_Type32 
##                          "integer"                          "integer" 
##                        Soil_Type33                        Soil_Type34 
##                          "integer"                          "integer" 
##                        Soil_Type35                        Soil_Type36 
##                          "integer"                          "integer" 
##                        Soil_Type37                        Soil_Type38 
##                          "integer"                          "integer" 
##                        Soil_Type39                        Soil_Type40 
##                          "integer"                          "integer" 
##                          targetVar 
##                           "factor"

2.a.iv) Statistical summary of all attributes.

summary(Xy_train)
##    Elevation        Aspect          Slope      
##  Min.   :1874   Min.   :  0.0   Min.   : 0.00  
##  1st Qu.:2377   1st Qu.: 65.0   1st Qu.:10.00  
##  Median :2751   Median :126.0   Median :15.00  
##  Mean   :2751   Mean   :156.8   Mean   :16.48  
##  3rd Qu.:3108   3rd Qu.:260.0   3rd Qu.:22.00  
##  Max.   :3846   Max.   :360.0   Max.   :50.00  
##                                                
##  Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology
##  Min.   :   0.0                   Min.   :-134.00               
##  1st Qu.:  60.0                   1st Qu.:   4.00               
##  Median : 180.0                   Median :  32.00               
##  Mean   : 226.4                   Mean   :  50.53               
##  3rd Qu.: 323.2                   3rd Qu.:  79.00               
##  Max.   :1343.0                   Max.   : 554.00               
##                                                                 
##  Horizontal_Distance_To_Roadways Hillshade_9am   Hillshade_Noon 
##  Min.   :   0                    Min.   : 58.0   Min.   : 99.0  
##  1st Qu.: 768                    1st Qu.:196.0   1st Qu.:207.0  
##  Median :1317                    Median :220.0   Median :223.0  
##  Mean   :1714                    Mean   :212.7   Mean   :219.1  
##  3rd Qu.:2263                    3rd Qu.:235.0   3rd Qu.:235.0  
##  Max.   :6836                    Max.   :254.0   Max.   :254.0  
##                                                                 
##  Hillshade_3pm   Horizontal_Distance_To_Fire_Points Wilderness_Area1
##  Min.   :  0.0   Min.   :  30                       Min.   :0.0000  
##  1st Qu.:107.0   1st Qu.: 732                       1st Qu.:0.0000  
##  Median :138.0   Median :1256                       Median :0.0000  
##  Mean   :135.2   Mean   :1515                       Mean   :0.2367  
##  3rd Qu.:167.0   3rd Qu.:1992                       3rd Qu.:0.0000  
##  Max.   :247.0   Max.   :6993                       Max.   :1.0000  
##                                                                     
##  Wilderness_Area2  Wilderness_Area3 Wilderness_Area4   Soil_Type1    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.03231   Mean   :0.4228   Mean   :0.3082   Mean   :0.0239  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##    Soil_Type2        Soil_Type3        Soil_Type4        Soil_Type5     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.04176   Mean   :0.06349   Mean   :0.05678   Mean   :0.01058  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##    Soil_Type6        Soil_Type7   Soil_Type8         Soil_Type9       
##  Min.   :0.00000   Min.   :0    Min.   :0.00e+00   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0    1st Qu.:0.00e+00   1st Qu.:0.0000000  
##  Median :0.00000   Median :0    Median :0.00e+00   Median :0.0000000  
##  Mean   :0.04091   Mean   :0    Mean   :9.45e-05   Mean   :0.0006614  
##  3rd Qu.:0.00000   3rd Qu.:0    3rd Qu.:0.00e+00   3rd Qu.:0.0000000  
##  Max.   :1.00000   Max.   :0    Max.   :1.00e+00   Max.   :1.0000000  
##                                                                       
##   Soil_Type10      Soil_Type11       Soil_Type12       Soil_Type13     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.1419   Mean   :0.02731   Mean   :0.01455   Mean   :0.03193  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                        
##   Soil_Type14       Soil_Type15  Soil_Type16        Soil_Type17     
##  Min.   :0.00000   Min.   :0    Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0    1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0    Median :0.000000   Median :0.00000  
##  Mean   :0.01039   Mean   :0    Mean   :0.008031   Mean   :0.04034  
##  3rd Qu.:0.00000   3rd Qu.:0    3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :0    Max.   :1.000000   Max.   :1.00000  
##                                                                     
##   Soil_Type18        Soil_Type19        Soil_Type20      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.003874   Mean   :0.003212   Mean   :0.009448  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##                                                          
##   Soil_Type21        Soil_Type22       Soil_Type23       Soil_Type24     
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.001323   Mean   :0.02353   Mean   :0.05102   Mean   :0.01635  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##   Soil_Type25        Soil_Type26        Soil_Type27       
##  Min.   :0.00e+00   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.00e+00   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.00e+00   Median :0.000000   Median :0.0000000  
##  Mean   :9.45e-05   Mean   :0.003401   Mean   :0.0008503  
##  3rd Qu.:0.00e+00   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.00e+00   Max.   :1.000000   Max.   :1.0000000  
##                                                           
##   Soil_Type28         Soil_Type29       Soil_Type30       Soil_Type31     
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.0002834   Mean   :0.08626   Mean   :0.04639   Mean   :0.02173  
##  3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                           
##   Soil_Type32       Soil_Type33       Soil_Type34        Soil_Type35      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.000000  
##  Mean   :0.04639   Mean   :0.04101   Mean   :0.001417   Mean   :0.007086  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000000  
##                                                                           
##   Soil_Type36         Soil_Type37       Soil_Type38       Soil_Type39     
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.0006614   Mean   :0.00189   Mean   :0.04639   Mean   :0.04403  
##  3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                           
##   Soil_Type40      targetVar
##  Min.   :0.00000   1:1512   
##  1st Qu.:0.00000   2:1512   
##  Median :0.00000   3:1512   
##  Mean   :0.03071   4:1512   
##  3rd Qu.:0.00000   5:1512   
##  Max.   :1.00000   6:1512   
##                    7:1512

2.a.v) Count missing values.

sapply(Xy_train, function(x) sum(is.na(x)))
##                          Elevation                             Aspect 
##                                  0                                  0 
##                              Slope   Horizontal_Distance_To_Hydrology 
##                                  0                                  0 
##     Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
##                                  0                                  0 
##                      Hillshade_9am                     Hillshade_Noon 
##                                  0                                  0 
##                      Hillshade_3pm Horizontal_Distance_To_Fire_Points 
##                                  0                                  0 
##                   Wilderness_Area1                   Wilderness_Area2 
##                                  0                                  0 
##                   Wilderness_Area3                   Wilderness_Area4 
##                                  0                                  0 
##                         Soil_Type1                         Soil_Type2 
##                                  0                                  0 
##                         Soil_Type3                         Soil_Type4 
##                                  0                                  0 
##                         Soil_Type5                         Soil_Type6 
##                                  0                                  0 
##                         Soil_Type7                         Soil_Type8 
##                                  0                                  0 
##                         Soil_Type9                        Soil_Type10 
##                                  0                                  0 
##                        Soil_Type11                        Soil_Type12 
##                                  0                                  0 
##                        Soil_Type13                        Soil_Type14 
##                                  0                                  0 
##                        Soil_Type15                        Soil_Type16 
##                                  0                                  0 
##                        Soil_Type17                        Soil_Type18 
##                                  0                                  0 
##                        Soil_Type19                        Soil_Type20 
##                                  0                                  0 
##                        Soil_Type21                        Soil_Type22 
##                                  0                                  0 
##                        Soil_Type23                        Soil_Type24 
##                                  0                                  0 
##                        Soil_Type25                        Soil_Type26 
##                                  0                                  0 
##                        Soil_Type27                        Soil_Type28 
##                                  0                                  0 
##                        Soil_Type29                        Soil_Type30 
##                                  0                                  0 
##                        Soil_Type31                        Soil_Type32 
##                                  0                                  0 
##                        Soil_Type33                        Soil_Type34 
##                                  0                                  0 
##                        Soil_Type35                        Soil_Type36 
##                                  0                                  0 
##                        Soil_Type37                        Soil_Type38 
##                                  0                                  0 
##                        Soil_Type39                        Soil_Type40 
##                                  0                                  0 
##                          targetVar 
##                                  0

2.a.vi) Summarize the levels of the class attribute.

cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##   freq percentage
## 1 1512   14.28571
## 2 1512   14.28571
## 3 1512   14.28571
## 4 1512   14.28571
## 5 1512   14.28571
## 6 1512   14.28571
## 7 1512   14.28571

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute.

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(X_train[,i], main=names(X_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(X_train[,i], main=names(X_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(X_train[,i]), main=names(X_train)[i])
}

2.b.ii) Multivariate plots to better understand the relationships between attributes

# Scatterplot matrix colored by class
# pairs(targetVar~., data=Xy_train, col=Xy_train$targetVar)
# Box and whisker plots for each attribute by class
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# featurePlot(x=X_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
# featurePlot(x=X_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(X_train)
## Warning in cor(X_train): the standard deviation is zero
corrplot(correlations, method="circle")

if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))

3.a) Data Transforms

# Not applicable for this iteration of the project.

3.b) Splitting Data into Training and Testing Sets

# Not applicable for this iteration of the project.

3.c) Feature Selection

# Perform the Recursive Feature Elimination (RFE) technique
startTimeModule <- proc.time()
set.seed(seedNum)
X_rfe <- Xy_train[,1:totAttr]
y_rfe <- Xy_train[,totCol]
normalization <- preProcess(X_rfe)
## Warning in preProcess.default(X_rfe): These variables have zero variances:
## Soil_Type7, Soil_Type15
X_rfe <- predict(normalization, X_rfe)
X_rfe <- as.data.frame(X_rfe)
rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10, repeats=1, verbose=FALSE, returnResamp="all")
optimalVars <- 50
subsets <- c(2:optimalVars)
rfeProfile <- rfe(X_rfe, y_rfe, sizes=subsets, rfeControl=rfeCTRL)
print(rfeProfile)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD  KappaSD Selected
##          2   0.6002 0.5335   0.014390 0.016786         
##          3   0.6917 0.6403   0.019110 0.022293         
##          4   0.7901 0.7551   0.015449 0.018020         
##          5   0.8093 0.7776   0.013989 0.016317         
##          6   0.8206 0.7907   0.017082 0.019927         
##          7   0.8350 0.8075   0.011035 0.012872         
##          8   0.8298 0.8015   0.011884 0.013863         
##          9   0.8280 0.7994   0.011159 0.013016         
##         10   0.8300 0.8017   0.008177 0.009537         
##         11   0.8402 0.8136   0.006374 0.007436         
##         12   0.8419 0.8156   0.011047 0.012885        *
##         13   0.8403 0.8137   0.010572 0.012331         
##         14   0.8359 0.8085   0.011841 0.013814         
##         15   0.8332 0.8054   0.009684 0.011296         
##         16   0.8415 0.8150   0.010639 0.012409         
##         17   0.8401 0.8135   0.012045 0.014050         
##         18   0.8411 0.8146   0.009546 0.011134         
##         19   0.8383 0.8114   0.011660 0.013601         
##         20   0.8352 0.8078   0.009707 0.011323         
##         21   0.8343 0.8067   0.010848 0.012654         
##         22   0.8348 0.8073   0.009347 0.010902         
##         23   0.8313 0.8031   0.008934 0.010421         
##         24   0.8297 0.8014   0.011165 0.013024         
##         25   0.8411 0.8146   0.010265 0.011974         
##         26   0.8412 0.8147   0.011118 0.012969         
##         27   0.8403 0.8137   0.011142 0.012998         
##         28   0.8387 0.8118   0.010594 0.012359         
##         29   0.8346 0.8070   0.013208 0.015408         
##         30   0.8342 0.8066   0.011277 0.013155         
##         31   0.8325 0.8046   0.011607 0.013540         
##         32   0.8305 0.8023   0.012344 0.014400         
##         33   0.8291 0.8006   0.012798 0.014930         
##         34   0.8267 0.7978   0.012917 0.015069         
##         35   0.8251 0.7960   0.013152 0.015343         
##         36   0.8365 0.8092   0.013346 0.015569         
##         37   0.8364 0.8091   0.015186 0.017716         
##         38   0.8335 0.8058   0.013125 0.015311         
##         39   0.8325 0.8046   0.012022 0.014023         
##         40   0.8306 0.8024   0.011440 0.013345         
##         41   0.8302 0.8019   0.013796 0.016093         
##         42   0.8287 0.8002   0.013907 0.016223         
##         43   0.8250 0.7959   0.010943 0.012765         
##         44   0.8253 0.7962   0.013644 0.015916         
##         45   0.8231 0.7937   0.012249 0.014287         
##         46   0.8220 0.7923   0.013252 0.015458         
##         47   0.8185 0.7883   0.010991 0.012820         
##         48   0.8141 0.7831   0.011523 0.013441         
##         49   0.8292 0.8007   0.015338 0.017892         
##         50   0.8265 0.7976   0.012120 0.014137         
##         54   0.8180 0.7877   0.012557 0.014647         
## 
## The top 5 variables (out of 12):
##    Elevation, Horizontal_Distance_To_Roadways, Horizontal_Distance_To_Hydrology, Horizontal_Distance_To_Fire_Points, Vertical_Distance_To_Hydrology
plot(rfeProfile, type=c("g", "o"))

# Perform the Recursive Feature Elimination (RFE) technique
numberRFEVars <- length(predictors(rfeProfile))
if (numberRFEVars <= optimalVars) {
  rfeAttributes <- predictors(rfeProfile)
} else {
  newProfile <- update(rfeProfile, x=X_rfe, y=y_rfe, size=optimalVars)
  rfeAttributes <- newProfile$bestVar
}
cat('Number of attributes selected from the RFE algorithm:',length(rfeAttributes),'\n')
## Number of attributes selected from the RFE algorithm: 12
print(rfeAttributes)
##  [1] "Elevation"                         
##  [2] "Horizontal_Distance_To_Roadways"   
##  [3] "Horizontal_Distance_To_Hydrology"  
##  [4] "Horizontal_Distance_To_Fire_Points"
##  [5] "Vertical_Distance_To_Hydrology"    
##  [6] "Hillshade_Noon"                    
##  [7] "Hillshade_9am"                     
##  [8] "Hillshade_3pm"                     
##  [9] "Aspect"                            
## [10] "Wilderness_Area4"                  
## [11] "Soil_Type10"                       
## [12] "Wilderness_Area1"
# Removing the unselected attributes from the training and validation dataframes
rfeAttributes <- c(rfeAttributes,"targetVar")
Xy_train <- Xy_train[, (names(Xy_train) %in% rfeAttributes)]
Xy_test <- Xy_test[, (names(Xy_test) %in% rfeAttributes)]

3.d) Display the Final Dataset for Model-Building

dim(Xy_train)
## [1] 10584    13
dim(Xy_test)
## [1] 4536   13
sapply(Xy_train, class)
##                          Elevation                             Aspect 
##                          "integer"                          "integer" 
##   Horizontal_Distance_To_Hydrology     Vertical_Distance_To_Hydrology 
##                          "integer"                          "integer" 
##    Horizontal_Distance_To_Roadways                      Hillshade_9am 
##                          "integer"                          "integer" 
##                     Hillshade_Noon                      Hillshade_3pm 
##                          "integer"                          "integer" 
## Horizontal_Distance_To_Fire_Points                   Wilderness_Area1 
##                          "integer"                          "integer" 
##                   Wilderness_Area4                        Soil_Type10 
##                          "integer"                          "integer" 
##                          targetVar 
##                           "factor"
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
proc.time()-startTimeScript
##      user    system   elapsed 
## 14281.941    36.026 14345.447

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:

Linear Algorithm: Linear Discriminant Analysis

Non-Linear Algorithm: Decision Trees (CART)

Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

startModeling <- proc.time()
# Linear Discriminant Analysis (Classification)
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
# startTimeModule <- proc.time()
# set.seed(seedNum)
# fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
# print(fit.lda)
# proc.time()-startTimeModule
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.09656085  0.4616631  0.37194057
##   0.13966049  0.3336852  0.22266751
##   0.16666667  0.2140159  0.08322312
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09656085.
proc.time()-startTimeModule
##    user  system elapsed 
##   2.478   0.856   2.387
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8234126  0.7939812
proc.time()-startTimeModule
##    user  system elapsed 
##  43.964  24.746  39.495
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8197284  0.7896826
##    7    0.8417432  0.8153668
##   12    0.8331438  0.8053340
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
proc.time()-startTimeModule
##    user  system elapsed 
## 307.654   2.061 310.372
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy 
##   0.3  1          0.6               0.50        50      0.6792317
##   0.3  1          0.6               0.50       100      0.7031374
##   0.3  1          0.6               0.50       150      0.7155177
##   0.3  1          0.6               0.75        50      0.6783846
##   0.3  1          0.6               0.75       100      0.7044618
##   0.3  1          0.6               0.75       150      0.7134367
##   0.3  1          0.6               1.00        50      0.6750755
##   0.3  1          0.6               1.00       100      0.7007749
##   0.3  1          0.6               1.00       150      0.7112663
##   0.3  1          0.8               0.50        50      0.6748842
##   0.3  1          0.8               0.50       100      0.7074851
##   0.3  1          0.8               0.50       150      0.7139084
##   0.3  1          0.8               0.75        50      0.6776288
##   0.3  1          0.8               0.75       100      0.7052192
##   0.3  1          0.8               0.75       150      0.7172182
##   0.3  1          0.8               1.00        50      0.6740368
##   0.3  1          0.8               1.00       100      0.6992645
##   0.3  1          0.8               1.00       150      0.7114555
##   0.3  2          0.6               0.50        50      0.7227910
##   0.3  2          0.6               0.50       100      0.7505667
##   0.3  2          0.6               0.50       150      0.7624741
##   0.3  2          0.6               0.75        50      0.7249628
##   0.3  2          0.6               0.75       100      0.7521732
##   0.3  2          0.6               0.75       150      0.7641719
##   0.3  2          0.6               1.00        50      0.7195780
##   0.3  2          0.6               1.00       100      0.7450870
##   0.3  2          0.6               1.00       150      0.7621876
##   0.3  2          0.8               0.50        50      0.7293085
##   0.3  2          0.8               0.50       100      0.7541579
##   0.3  2          0.8               0.50       150      0.7675740
##   0.3  2          0.8               0.75        50      0.7293097
##   0.3  2          0.8               0.75       100      0.7542545
##   0.3  2          0.8               0.75       150      0.7663507
##   0.3  2          0.8               1.00        50      0.7215624
##   0.3  2          0.8               1.00       100      0.7486817
##   0.3  2          0.8               1.00       150      0.7634192
##   0.3  3          0.6               0.50        50      0.7569923
##   0.3  3          0.6               0.50       100      0.7805194
##   0.3  3          0.6               0.50       150      0.7884566
##   0.3  3          0.6               0.75        50      0.7529289
##   0.3  3          0.6               0.75       100      0.7825943
##   0.3  3          0.6               0.75       150      0.7962009
##   0.3  3          0.6               1.00        50      0.7481107
##   0.3  3          0.6               1.00       100      0.7808001
##   0.3  3          0.6               1.00       150      0.7917573
##   0.3  3          0.8               0.50        50      0.7599218
##   0.3  3          0.8               0.50       100      0.7876036
##   0.3  3          0.8               0.50       150      0.7960136
##   0.3  3          0.8               0.75        50      0.7609587
##   0.3  3          0.8               0.75       100      0.7864707
##   0.3  3          0.8               0.75       150      0.7979948
##   0.3  3          0.8               1.00        50      0.7548173
##   0.3  3          0.8               1.00       100      0.7816530
##   0.3  3          0.8               1.00       150      0.7951616
##   0.4  1          0.6               0.50        50      0.6933133
##   0.4  1          0.6               0.50       100      0.7162725
##   0.4  1          0.6               0.50       150      0.7200517
##   0.4  1          0.6               0.75        50      0.6862229
##   0.4  1          0.6               0.75       100      0.7118321
##   0.4  1          0.6               0.75       150      0.7206166
##   0.4  1          0.6               1.00        50      0.6858490
##   0.4  1          0.6               1.00       100      0.7088098
##   0.4  1          0.6               1.00       150      0.7176920
##   0.4  1          0.8               0.50        50      0.6931219
##   0.4  1          0.8               0.50       100      0.7142885
##   0.4  1          0.8               0.50       150      0.7207128
##   0.4  1          0.8               0.75        50      0.6907621
##   0.4  1          0.8               0.75       100      0.7122101
##   0.4  1          0.8               0.75       150      0.7212791
##   0.4  1          0.8               1.00        50      0.6866038
##   0.4  1          0.8               1.00       100      0.7071103
##   0.4  1          0.8               1.00       150      0.7152362
##   0.4  2          0.6               0.50        50      0.7380012
##   0.4  2          0.6               0.50       100      0.7591661
##   0.4  2          0.6               0.50       150      0.7715436
##   0.4  2          0.6               0.75        50      0.7298768
##   0.4  2          0.6               0.75       100      0.7584095
##   0.4  2          0.6               0.75       150      0.7716374
##   0.4  2          0.6               1.00        50      0.7316727
##   0.4  2          0.6               1.00       100      0.7572761
##   0.4  2          0.6               1.00       150      0.7707905
##   0.4  2          0.8               0.50        50      0.7399869
##   0.4  2          0.8               0.50       100      0.7600167
##   0.4  2          0.8               0.50       150      0.7698431
##   0.4  2          0.8               0.75        50      0.7413090
##   0.4  2          0.8               0.75       100      0.7660663
##   0.4  2          0.8               0.75       150      0.7745682
##   0.4  2          0.8               1.00        50      0.7341281
##   0.4  2          0.8               1.00       100      0.7576565
##   0.4  2          0.8               1.00       150      0.7709773
##   0.4  3          0.6               0.50        50      0.7641744
##   0.4  3          0.6               0.50       100      0.7830730
##   0.4  3          0.6               0.50       150      0.7910073
##   0.4  3          0.6               0.75        50      0.7663451
##   0.4  3          0.6               0.75       100      0.7900626
##   0.4  3          0.6               0.75       150      0.7994181
##   0.4  3          0.6               1.00        50      0.7627546
##   0.4  3          0.6               1.00       100      0.7876052
##   0.4  3          0.6               1.00       150      0.7954459
##   0.4  3          0.8               0.50        50      0.7703129
##   0.4  3          0.8               0.50       100      0.7898706
##   0.4  3          0.8               0.50       150      0.7976186
##   0.4  3          0.8               0.75        50      0.7729615
##   0.4  3          0.8               0.75       100      0.7938395
##   0.4  3          0.8               0.75       150      0.8031933
##   0.4  3          0.8               1.00        50      0.7668207
##   0.4  3          0.8               1.00       100      0.7921388
##   0.4  3          0.8               1.00       150      0.8003606
##   Kappa    
##   0.6257675
##   0.6536585
##   0.6681022
##   0.6247807
##   0.6552040
##   0.6656750
##   0.6209193
##   0.6509023
##   0.6631423
##   0.6206966
##   0.6587305
##   0.6662256
##   0.6238983
##   0.6560874
##   0.6700862
##   0.6197082
##   0.6491399
##   0.6633636
##   0.6765881
##   0.7089933
##   0.7228859
##   0.6791216
##   0.7108676
##   0.7248666
##   0.6728391
##   0.7026003
##   0.7225518
##   0.6841916
##   0.7131834
##   0.7288357
##   0.6841934
##   0.7132962
##   0.7274086
##   0.6751541
##   0.7067936
##   0.7239886
##   0.7164906
##   0.7439382
##   0.7531987
##   0.7117489
##   0.7463590
##   0.7622335
##   0.7061280
##   0.7442661
##   0.7570498
##   0.7199080
##   0.7522032
##   0.7620152
##   0.7211180
##   0.7508826
##   0.7643271
##   0.7139521
##   0.7452608
##   0.7610208
##   0.6421964
##   0.6689826
##   0.6733922
##   0.6339247
##   0.6638023
##   0.6740512
##   0.6334885
##   0.6602767
##   0.6706393
##   0.6419732
##   0.6666681
##   0.6741626
##   0.6392202
##   0.6642433
##   0.6748237
##   0.6343689
##   0.6582934
##   0.6677740
##   0.6943337
##   0.7190265
##   0.7334676
##   0.6848547
##   0.7181440
##   0.7335766
##   0.6869499
##   0.7168210
##   0.7325876
##   0.6966499
##   0.7200188
##   0.7314833
##   0.6981932
##   0.7270770
##   0.7369957
##   0.6898147
##   0.7172652
##   0.7328064
##   0.7248696
##   0.7469179
##   0.7561747
##   0.7274021
##   0.7550727
##   0.7659877
##   0.7232124
##   0.7522058
##   0.7613534
##   0.7320314
##   0.7548490
##   0.7638886
##   0.7351212
##   0.7594791
##   0.7703924
##   0.7279559
##   0.7574943
##   0.7670868
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
##  and subsample = 0.75.
proc.time()-startTimeModule
##     user   system  elapsed 
## 1660.538   22.384  852.271
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))

4.d) Compare baseline algorithms

results <- resamples(list(CART=fit.cart, BDT=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, BDT, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.4018868 0.4288752 0.4806238 0.4616631 0.4890160 0.4962193    0
## BDT  0.8119093 0.8180319 0.8219193 0.8234126 0.8270321 0.8421550    0
## RF   0.8300283 0.8394418 0.8408870 0.8417432 0.8455380 0.8525520    0
## GBM  0.7892250 0.7996692 0.8007561 0.8031933 0.8072835 0.8204159    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.3020993 0.3337927 0.3938647 0.3719406 0.4038891 0.4121922    0
## BDT  0.7805586 0.7877062 0.7922423 0.7939812 0.7982042 0.8158487    0
## RF   0.8017071 0.8126812 0.8143637 0.8153668 0.8197955 0.8279791    0
## GBM  0.7540940 0.7662846 0.7675493 0.7703924 0.7751639 0.7904835    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`CART~Accuracy`,results$values$`BDT~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)),'\n')
## The average accuracy from all models is: 0.7325031
cat('Total training time for all models:',proc.time()-startModeling)
## Total training time for all models: 2015.145 50.048 1205.037 0 0

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Bagged CART
# No tuning parameters available for "treebag" in the caret package
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.final1 <- fit.bagcart
print(fit.final1)
## Bagged CART 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8234126  0.7939812
proc.time()-startTimeModule
##    user  system elapsed 
##   0.006   0.001   0.008
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
# Tuning algorithm #2 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2,5,7,10,12))
fit.final2 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)

print(fit.final2)
## Random Forest 
## 
## 10584 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8178379  0.7874769
##    5    0.8424034  0.8161366
##    7    0.8394758  0.8127214
##   10    0.8361681  0.8088626
##   12    0.8321061  0.8041235
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
proc.time()-startTimeModule
##    user  system elapsed 
## 557.843   3.669 562.572
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))

5.d) Compare Algorithms After Tuning

results <- resamples(list(BDT=fit.final1, RF=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: BDT, RF 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## BDT 0.8119093 0.8180319 0.8219193 0.8234126 0.8270321 0.8421550    0
## RF  0.8344371 0.8379017 0.8413605 0.8424034 0.8445010 0.8563327    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## BDT 0.7805586 0.7877062 0.7922423 0.7939812 0.7982042 0.8158487    0
## RF  0.8068433 0.8108822 0.8149208 0.8161366 0.8185794 0.8323887    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5   6   7
##          1 452 126   0   0   0   0  33
##          2 116 421   7   0  15   1   2
##          3   2  20 516  19   9  86   0
##          4   0   0  34 622   0  24   0
##          5  15  57  10   0 612   7   1
##          6   3  18  81   7  12 530   0
##          7  60   6   0   0   0   0 612
## 
## Overall Statistics
##                                           
##                Accuracy : 0.83            
##                  95% CI : (0.8188, 0.8409)
##     No Information Rate : 0.1429          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8017          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.69753  0.64969   0.7963   0.9599   0.9444   0.8179
## Specificity           0.95910  0.96373   0.9650   0.9851   0.9769   0.9689
## Pos Pred Value        0.73977  0.74911   0.7914   0.9147   0.8718   0.8141
## Neg Pred Value        0.95006  0.94288   0.9660   0.9933   0.9906   0.9696
## Prevalence            0.14286  0.14286   0.1429   0.1429   0.1429   0.1429
## Detection Rate        0.09965  0.09281   0.1138   0.1371   0.1349   0.1168
## Detection Prevalence  0.13470  0.12390   0.1437   0.1499   0.1548   0.1435
## Balanced Accuracy     0.82832  0.80671   0.8807   0.9725   0.9606   0.8934
##                      Class: 7
## Sensitivity            0.9444
## Specificity            0.9830
## Pos Pred Value         0.9027
## Neg Pred Value         0.9907
## Prevalence             0.1429
## Detection Rate         0.1349
## Detection Prevalence   0.1495
## Balanced Accuracy      0.9637
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5   6   7
##          1 469 103   0   0   0   0  25
##          2 108 449   9   0  10   5   0
##          3   0  18 522  18   7  70   0
##          4   0   0  35 624   0  26   0
##          5  17  53   6   0 619   7   1
##          6   2  19  76   6  12 540   0
##          7  52   6   0   0   0   0 622
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8477         
##                  95% CI : (0.8369, 0.858)
##     No Information Rate : 0.1429         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8223         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.7238  0.69290   0.8056   0.9630   0.9552   0.8333
## Specificity            0.9671  0.96605   0.9709   0.9843   0.9784   0.9704
## Pos Pred Value         0.7856  0.77281   0.8220   0.9109   0.8805   0.8244
## Neg Pred Value         0.9546  0.94968   0.9677   0.9938   0.9924   0.9722
## Prevalence             0.1429  0.14286   0.1429   0.1429   0.1429   0.1429
## Detection Rate         0.1034  0.09899   0.1151   0.1376   0.1365   0.1190
## Detection Prevalence   0.1316  0.12809   0.1400   0.1510   0.1550   0.1444
## Balanced Accuracy      0.8454  0.82948   0.8882   0.9736   0.9668   0.9019
##                      Class: 7
## Sensitivity            0.9599
## Specificity            0.9851
## Pos Pred Value         0.9147
## Neg Pred Value         0.9933
## Prevalence             0.1429
## Detection Rate         0.1371
## Detection Prevalence   0.1499
## Balanced Accuracy      0.9725

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(seedNum)

# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_complete <- rbind(Xy_train, Xy_test)

# finalModel <- randomForest(targetVar~., xy_complete, mtry=31, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
##    user  system elapsed 
##   0.019   0.000   0.020

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
proc.time()-startTimeScript
##      user    system   elapsed 
## 16857.372    89.837 16115.500